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Higher-Order Asymptotics and Its Use to Test the Equality of the Examinee Ability Over 


Two Sets of Items 


Abstract 

In educational and psychological measurement, researchers and/or practitioners are 
often interested in examining whether the ability of an examinee is the same over two 
sets of items. Such problems can arise in measurement of change, detection of cheating 
on unproctored tests, erasure analysis, detection of item preknowledge etc. Traditional 
frequentist approaches that are used in such problems include the Wald test, the likelihood 
ratio test, and the score test (e.g., Fischer, 2003; Finkelman, Weiss, & Kim-Kang, 2010; 
Glas & Dagohoy, 2007; Guo & Drasgow, 2010; Klauer & Rettig, 1990; Sinharay, 2017). This 
paper shows that approaches based on higher-order asymptotics (e.g., Barndorff-Nielsen & 
Cox, 1994; Ghosh, 1994) can also be used to test for the equality of the examinee ability 
over two sets of items. The modified signed likelihood ratio test (e.g. Barndorff-Nielsen, 
1986) and the Lugannani-Rice approximation (Lugannani & Rice, 1980), both of which 
are based on higher-order asymptotics, are shown to provide some improvement over the 
traditional frequentist approaches in three simulations. Two real data examples are also 


provided. 


Key words: detection of cheating, item preknowledge, measurement of change. 


In applications of item response theory (IRT), measurement researchers and 
practitioners are often interested in examining whether the ability of an examinee is the 


same on two sets of items. The following are examples of such problems: 


1. Two sets of items are answered at two different time-points and the investigator is 
interested in examining whether the examinee’s ability changed between the two time 
points. Researchers such as Fischer (2003) and Finkelman et al. (2010) showed that 
if the two sets of items were calibrated using an IRT model on the same scale, one 
can perform a statistical hypothesis test to determine whether the examinee ability is 
equal at the two time points. A significant test statistic would indicate a change in 


ability. 


2. A set of items on an assessment is not known to have been compromised and the re- 
maining items are known to have been compromised. In this case, the investigator may 
want to test for the equality of the examinee ability over these two sets of items (com- 
promised and non-compromised) to detect if the examinee cheated, that is, benefited 
from preknowledge of the compromised items. A significantly larger estimated ability 
on the compromised items may indicate cheating. See Sinharay (2017) for such an 
example. A similar problem involves two sets of items where the first set includes 
field-test or pretest items (and have not been used before) and the second includes 
operational items (and have been used before). Better performance by an examinee 
on the second set of items may indicate possible cheating. See, for example, Drasgow, 


Levine, and Zickar (1996), for a case like this. 


3. A set of items is answered in a proctored version of the assessment and another set is 
answered in an unproctored version of the assessment; the investigator may want to 
test for the equality of the examinee ability over these two assessments to detect if the 
examinee cheated on the unproctored version. See Guo and Drasgow (2010) for such 


an example. 


4. No erasures were found on the answer sheet on a set of items and erasures were found 


on the answer sheet on the remaining items; the investigator may want to test for the 
equality of the ability over these two sets of items to detect if the examinee benefited 
from cheating in the form of fraudulent erasures. See Wollack, Cohen, and Eckerly 


(2015) for such an example. 


5. The two sets of items belong to two subsections/subtests of an assessment and the 
investigator wants to assess person fit by testing the equality of the ability over the 
two subsections. See Glas and Dagohoy (2007), Klauer and Rettig (1990), and Klauer 


(1991) for such examples. 


In such applications, the investigator is often interested in testing against a one-sided 
alternative hypothesis because of an interest in detecting cheating on assessments. For 
example, Wollack and Schoenig (2018) categorized the statistical methods to detect 
cheating (on tests) into six categories one of which is “score differencing’—this category of 
methods essentially involves a test of the hypothesis of equal ability of an examinee (or 
a group of examinees) over two sets of items against a one-sided alternative hypothesis. 
Cheating would lead to a better performance on the second set of items compared to the 
first set of items. A one-sided alternative hypothesis maybe more appropriate in the second, 
third, and fourth of the above examples and also in the first example if the investigator 
is only interested in a positive change of ability because of, for example, cheating on the 
assessment (e.g., Lewis & Thayer, 1998, stated that a large positive score change often 
initiates further investigation on possible cheating by an examinee). This paper focuses on 
one-sided alternatives. 

Traditionally, researchers and practitioners in educational and psychological 
measurement have applied the Wald test, the likelihood ratio test (LRT), or the score test 
to test for the equality of the ability over two sets of items. See, for example, Fischer 
(2003), Finkelman et al. (2010), Glas and Dagohoy (2007), Guo and Drasgow (2010), 
Klauer and Rettig (1990), and Sinharay (2017). In some of the example problems cited 
above, researchers have used other methods to test the hypothesis, but those methods have 


been limited to only one context or one IRT model. For example, Wollack et al. (2015) 


used a test statistic called “erasure detection index” in the context of erasure analysis and 
Klauer (1991) used a uniformly most powerful test under the Rasch model in the context 
of testing for the equality of abilties over two subsections. These specific methods are not 
considered henceforth. Instead, the focus will be on the Wald test, the LRT, and the score 
test, which are often referred to as methods based on first-order asymptotics (e.g., Brazzale, 
Davison, & Reid, 2007, p. 1). 

The Wald test, the LRT, and the score test are appropriate for large samples, which, in 
the context of testing of the equality of the ability over two sets of items, means that these 
tests are appropriate when they are based on two large sets of items. However, the test of 
the equality of abilities often has to be performed using at least one small or moderately 
large set of items. For example, in erasure analysis (e.g., Wollack et al., 2015), the set of 
erased items is typically small (often consisting of fewer than 10 items). When one or more 
set of items is small, the methods based on first-order asymptotics often would not be 
appropriate and would have inflated Type I error rate or low power; for example, Guo and 
Drasgow (2010) found the Type I error rate of the Wald test to increase as the proctored 
test became shorter. Thus, there is a scope of further research on hypothesis testing 
methods that would perform better than those based on first-order asymptotics. 

One set of large-sample approaches that have been found to perform better than 
the methods based on first-order asymptotics in hypothesis testing based on small or 
moderately-sized samples in several areas of statistics involve higher-order asymptotics (e.g. 
Barndorff-Nielsen & Cox, 1994; Ghosh, 1994). Specifically, the modified signed likelihood 
ratio test (MSLRT; e.g. Barndorff-Nielsen, 1986) and the Lugannani-Rice approximation 
(LRA; Lugannani & Rice, 1980), both of which are based on higher-order asymptotics, 
have been used in hypothesis-testing problems in several areas of statistics and have been 
proved, theoretically and empirically, to have better properties compared to the Wald test, 
the LRT, and the score test (e.g., Barndorff-Nielsen, 1991; Pierce & Peters, 1992). While 
the MSLRT and the LRA usually perform similarly to the methods based on first-order 
asymptotics for very large samples, the former methods often perform considerably better 


than the latter methods for samples that are not very large (e.g., Barndorff-Nielsen, 1991). 


Higher-order asymptotics have found a few applications in educational and/or 
psychological measurement. Bedrick (1997) and von Davier and Molenaar (2003) applied 
Edgeworth expansions, which are based on higher-order asymptotics, to obtain the 
distributional form of person-fit indices for the dichotomous and polytomous Rasch 
models. Biehler, Holling, and Doebler (2015) suggested a saddlepoint approximation of 
the distribution of the ability parameter for the two-parameter logistic model (2PLM). 
However, there are no known applications of higher-order asymptotics to test the equality 
of abilties in the context of educational and/or psychological measurement. Thus, this 
paper attempts to fill an important void in the literature. 

The next section includes a literature review—the existing tests for the equality of the 
abilities over two sets of items are discussed followed by a review of the MSLRT and LRA. 
The Methods section includes the derivations of the MSLRT and LRA for the 2PLM and 
the generalized partial credit model (GPCM; Muraki, 1992). The MSLRT and LRA are 
compared to the Wald test, the LRT-based test, and the score test using simulated data 
sets in the Simulation section. Two real data examples are included in the penultimate 
section. Conclusions and recommendations are provided in the last section. 

The item parameters are usually not estimated and assumed known in tests of 
hypothesis regarding the examinee ability in IRT applications (e.g., Glas & Dagohoy, 2007). 
The same assumption of known item parameters was made here, one reason being that 
Glas and Dagohoy (2007) found that accounting for the uncertainty in the (estimated) item 
parameters had little impact on the testing of the equality of abilities. 


Background 
Joint Likelihood Over Two Sets of Items 


Let us consider two sets of binary items that have been calibrated using the 2PLM. 
Let the true slope and difficulty parameters of the first set of items be denoted by a; and 
b;’s, respectively, where 7 = 1,2,--- ,n,, and let the true slope and difficulty parameters of 
the second set of items be denoted by a; and b,’s, respectively, where 7 = 1,2,--- , mo. Let 


the true ability of a randomly chosen examinee on the two sets of items be denoted by 0; 


and 65, respectively and let the examinee’s scores on the two sets of items be denoted by 
Xj,7 = 1,2,--- ,n1 and Y;,7 = 1,2,--- ,n2, respectively. Let us denote the probability of a 
correct answer under the 2PLM on the two sets of items as 


exp[a;(0, — 6,)] 
1 + expla;(@, — 5;)] 


__explaij(4 — by) 
1+ explaj(62 — by)] 


P(X, =1)= = p;(6,) and PAYG a = p; (02): 


The joint log-likelihood ¢(6;, 02) of the two true abilities for an examinee is given by 


£(0;, 02) 
= SY [Kitogp(61) + 0 — X1) log( — p.(81))] +o BS log (Ps) + (1 — ¥4) log — #5 (02) 
=D | xtog BOP + towtt — ater))] + 3 [tow POE 4+ towtr — 95060) 


= pe [Xia;(01 — b;) + log(1 — pi(1))] + dX Ysa (6. — b;) + log(1 — ,(02))| 


= Sop= Dae Do8( (1 — p;(01)) + S262 — 2 Yotsby ee 2 l0a( (1 — p,;(62)), 


where “DX a; and Sy = 2 Yay 


Existing Methods for Testing the Equality of Ability Over Two Sets of Items 


Primarily, three hypothesis testing approaches have been suggested for testing the 
equality of the true ability on two sets of items, that is, for testing Hp : 6; = 02 in the 
context of IRT models: the Wald test, the LRT, and the score test. Researchers such as 
Fischer (2003), Finkelman et al. (2010), and Klauer and Rettig (1990) have suggested 
the use of the Wald test for testing the hypothesis. Let 6, and 65 denote the maximum 
likelihood estimates (MLE) of 0; and 02, respectively. Let 6 denote the MLE of the 
common ability parameter computed from the examinee’s scores on the two sets of items 
(combined). The Wald test statistic (that leads to the Wald test or the Z-test) is given by 
6 — 


Fe 
3000) + $360) 


(2) 


where s?(0o) and AC are the estimated variances of 6; and 65, respectively, both of which 


are computed at 69.1 For the 2PLM, 


$3(0) = Enns (6) - ne a eus _ 


For long tests, the Z statistic approximately follows a standard normal null distribution 


a 


and its square follows a y? null distribution with one degree of freedom (Guo & Drasgow, 
2010; Finkelman et al., 2010; Klauer & Rettig, 1990). Under the alternative Hy : 02 > 61, Z 
is expected to be a large positive number. 

Guo and Drasgow (2010), Finkelman et al. (2010), and Klauer and Rettig (1990) 
suggested the use of the LRT for testing Hp : 6; = 62. To apply this test, one 
computes £(0;, 62), the joint likelihood of the two ability parameters, twice—once each at 
,= 6, A = 6, and 6, = bo, A = by. Then, the LRT statistic is computed as 

hee) (Oo Oo), (4) 
L(0;, 02) 
For long tests, —2log A follows a y? null distribution with one degree of freedom (Guo & 
Drasgow, 2010; Finkelman et al., 2010). 

If the alternative hypothesis is two-sided, the statistic —2log A can be used as it is. 

If, however, the alternative hypothesis is one-sided and is given by Hy : 02 > 0), Sinharay 


(2017) showed that the signed likelihood ratio (SLR) statistic defined as 


fix V—2log A if O 2 a1, (5) 
—/—2log A if 02 < 61, 
and suggested by, for example, Cox and Hinkley (1974, p. 315), is more appropriate. The 
asymptotic distribution of L, is standard normal under the null hypothesis (e.g., Sinharay, 
2017). 
In addition to the Wald test and the LRT-based test, the score test (e.g., Rao, 1973, 


p. 417) has also been used to test for equal abilities, for example, by Glas and Dagohoy 


‘Finkelman et al. (2010) suggested the computation of s?(0,) and s3(@2) using 0; = 02 = 69. Instead, 
one can use 6; = 6, and #2 = 65 to perform the Wald test—this variation did not produce results that are 


much different in a limited simulation. Therefore, results using 0; = 02 = 8p are reported in this paper. 
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(2007), Klauer and Rettig (1990), and Sinharay (2017). Denoting log(£L(41, @2)) = €(01, 42), 


the score statistic is given by 


OL(0;, O2) ae J0(O1, 02) 2 
R => (ont) s?(0) + (ont s3(00), (6) 


OL(61 02) 
004 


where, for example, |o,-0,—6,> 18 the first partial derivative of ((41, #2) with respect to 


6, at 6, = 65 = Bo. From Equation 1, 


A0(6,,0 (40 — bi 
(61, a a 2e ee Soa, exp[a;( go )] . 
06, (1 + exp[a;( — 0;)]) 


a 


The R statistic is asymptotically equivalent to —2log A and has an asymptotic y? null 

distribution with one degree of freedom (e.g., Cox & Hinkley, 1974; Rao, 1973). For 

one-sided alternative hypothesis, Sinharay (2017) suggested the signed score (SS) statistic 
VR if 6) > 61, 


Rs = ra (7) 
_J/R if As < 6, 


and showed the asymptotic null distribution of R, to be the standard normal distribution. 


Modified Signed Likelihood Ratio Test 


Barndorff-Nielsen (1986) suggested the modified signed likelihood ratio test (MSLRT) 
that can be used to test for the equality of two parameter vectors. Barndorff-Nielsen 
(1991), Barndorff-Nielsen and Cox (1994), Brazzale et al. (2007), and Reid (2003) provided 
accessible descriptions of the MSLRT. Let us consider the simple (and the most directly 
applicable to our case) application of the MSLRT when the probability model used to 
describe the data involves two parameters: ~ and \. Let the (scalar) parameter of interest 
be denoted as w and let the hypothesis of interest be Hp : YW = Wo. Let denote the (scalar) 
nuisance parameter so that one has to estimate but one is not interested in testing any 
hypotheses regarding ’. This framework subsumes the case of testing of equality of two 
abilities for ~ = 02 — 6;;A = 61, or, equivalently, 0; = A; 02 = w+ A, and wo = 0. 

Let €(w, A) denote the logarithm of the joint likelihood of w and X for a set of data. Let 
w and \ jointly maximize ¢(wW, A) with respect to w and X, respectively, that is, w and \ are 


t 


the joint MLEs. Let 7(w,) denote the observed information matrix, or, equivalently, the 


negative of the matrix of the second derivatives of ((wW, A). That is, 


: : 2 2 
waa { Jo ja») __ f Ser Se (3) 
SG eos eee Ue ny \ = B2U(p,r)  O2U(b,r) J? 

jro(y, ) Jaal, A) DAOw Or? 


where the individual elements of j(~, A) are denoted as jyy(w, A), Jur(w, A), Jrv(, A), and 
Jar (w, r). 
Let ¢,(w) denote the profile log likelihood that is defined as the maximum value of the 


joint likelihood of w and A when maximized with respect to A for a fixed w, that is, 


p(t) = max &(, d) = &(0), dy): (9) 


Thus, des is the constrained maximum likelihood estimate of A, where w has been 
constrained to be equal to a fixed value. Note that the maximum profile likelihood estimate 


is the same as a, that is, 


A 


mac b9(14) = b(t) (10) 


(e.g., Barndorff-Nielsen & Cox, 1994, p. 90). 
To test H, : w = wo against a one-sided alternative H, : w > wo, one can use the SLR 


statistic or the likelihood-root statistic 


r(Wo) = sign(h — vo) 2llp(«b) — &p(vo)] (11) 


(e.g., Brazzale et al., 2007, p. 139). 
Some algebra shows that the expression for r(wW) for the 2PLM becomes identical to 
the expression for the L, statistic provided in Equation 5. If Ho is true, then r(wo) has a 
standard normal asymptotic distribution (e.g., Brazzale et al., 2007, p. 139). 
Barndorff-Nielsen (1986) and Barndorff-Nielsen (1991) suggested the statistic 
1 
ro) 8 re : 
referred to as the modified signed likelihood ratio (MSLR) statistic, where expressions of 


q(wo) are provided in Barndorff-Nielsen (1986) and Barndorff-Nielsen (1991); they also 


r*(Wo) =r(vo) 4 


(12) 
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proved that the statistic has a standard normal null distribution asymptotically. Further, 
several researchers such as Barndorff-Nielsen and Cox (1994, p. 203), Brazzale et al. (2007, 
p. 11), Jensen (1997), and Reid (2003, p. 1722) showed that for probability distributions 
that belong to the exponential family of distributions, g(W) can be computed as 

alo) = (b— vy), PV. (13) 

jra(o, Avo) 

The statistic q(wo) is a version of the Wald statistic (e.g., Brazzale et al., 2007, p. 12). 
Therefore, r*(wW9) can be considered to be an adjusted version of the SLR statistic r(wo), 
where the extent of adjustment depends on the relative magnitude of a version of the Wald 


statistic and the SLR statistic. 


Lugannani-Rice Approximation 
An alternative approach to the use of the MSLRT is the use of the Lugannani-Rice 
approximation (LRA; Lugannani & Rice, 1980) that expresses the probability of the SLR 


statistic being smaller than r(wo) under the null hypothesis as 


(r(vo)) = B(rUun)) + | T=] otr(wo)) (14) 


where ¢ denotes the standard normal density function and ® denotes the standard normal 


cumulative distribution function. Thus, to test H, : w = Wo against H, : w > wo, one using 
the LRA obtains the p-value as (1 — ®*(r(wWo))) whereas one using the MSLRT obtains the 
p-value as (1 — ®(r*(wWo))). Asymptotically, these two p-values are equivalent for probability 
distributions that belong to the exponential family of distributions (e.g., Brazzale et al., 

2007; Jensen, 1992). The LRA and MSLRT approaches are often referred to as the p* and 


the r* approaches, respectively, in the literature on higher-order asymptotics. 


The Advantages of the MSLRT and the Lugannani-Rice Approximation 


The MSLRT and the LRA are covered under the umbrella of methods that are referred 
to as higher-order asymptotics (e.g., Barndorff-Nielsen & Cox, 1994; Ghosh, 1994) and 


have been found to lead to more accurate asymptotic approximations than the Wald test, 
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the LRT, and the score test that are based on first-order asymptotic theory (e.g. Pierce & 
Peters, 1992; Brazzale et al., 2007, p. 1). 

While the asymptotic distribution of both r*(wWo) and r(wo) is standard normal 
under the null hypothesis, researchers such as Barndorff-Nielsen (1991) showed that 
for continuous response variables, the asymptotic result for r*(w9) holds with a higher 
degree of approximation compared to r(wo). Specifically, when the response variable is 
continuous, the relative error from the use of r*(qwo) is typically O(n-*/?), which means, for 
example, that if p denotes an estimated p-value computed using r*()) and p denotes the 


corresponding true p-value, then 


M, 
<—sasn—-> oo 
nal? 


(Barndorff-Nielsen, 1991), where n is the sample size and M, is a finite number. In contrast, 
the relative error from the use of r(o) is typically O(n~'/2), which means that if p! denotes 
an estimated p-value computed using r(wo), then 


ae 
a ifa ee ee 


(Barndorff-Nielsen, 1991), where Mp is a finite number. Because ats would typically be 


considerably smaller than me and converges to 0 much faster than a8. as n — oo, the 


p-value from the use of r*(wWo) is expected to be more accurate (that is, be close to the true 
p-value) compared to that from r(wo) for continuous response variables. Because the use of 
the LRA (Lugannani & Rice, 1980) is asymptotically equivalent to the use of r*(wW9) for the 
exponential family of distributions, the LRA has the same advantages as r*(wo) over r(wo). 
To test the null hypothesis H, : w = wo, it is also possible to employ the Wald statistic 
or the SS statistic (e.g., Brazzale et al., 2007, p. 139), whose expressions for the 2PLM are 
provided in Equations 2 and 7, respectively. The relative error of these two statistics is the 
same as that of r(wo) and is O(n~'/*). Therefore, if the response variable is continous, the 
p-values originating from r*(w9) and LRA are expected to be more accurate compared to 


those originating from the Wald statistic and the SS statistic as well. 
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The response variables in applications of IRT models are the item scores, which are 
discrete. No general result on the relative error for r*(wWo) or the LRA is available for such 
variables. However, if the probability distribution of a discrete response variable is a special 
case of the exponential family of distributions, it is possible to compute the MSLR statistic 
and the LRA using equations 12 and 14. Further, researchers such as Brazzale et al. (2007, 
Chapter 4) have found methods based on higher-order asymptotics to provide satisfactory 
results for discrete response variables. Therefore, even though the MSLR statistic and the 
LRA may not have relative error of O(n~*/?), they may still lead to better results compared 
to the Wald test, SLR test and score test for IRT models. The following section includes 
descriptions of the computation of the MSLR statistic and the LRA for IRT models. 


Methods: MSLRT and LRA for IRT Models 


As mentioned above, the test for the equality of two ability parameters, that is, the 
test Ho : 6, = 602 for IRT models can be placed in the framework discussed above by 
using the transformations wy = 62 — 6; and A = 6; and letting wo to be equal to 0. These 


transformations imply that 


62 =w+ and 6; = X- (15) 


MSLRT and LRA to Test for the Equality of Two Abilities for the 2PLM 


Equations 1 and 15 imply that the log-likelihood of ~ and A, €(w, A), is equal to 


Behe Dead, Meal ten) + So(b +A) - DY + Doel 1—$,;(~ +2)) 


&s Sag Aye > bs 1 pil 2) + Ylestt = 60+») 
So Ev (16) 


for the 2PLM. The log-likelihood is a special case of the log-likelihood of the exponential 
family of distributions with canonical parameters ~ and A and sufficient statistics S»5 


and (S; + S2) because the first two terms above depend only on 52, (5; + S2), and the 


Lt 


parameters, Terms 3-4 depend only on the parameters, and Terms 5-6 depend only on the 
data.? Therefore, the earlier discussion on the application of the MSLRT and the LRA to 
the exponential family of distributions implies that it is possible to apply the MSLRT and 
the LRA to test the hypothesis Ho : 6; = 02 and to obtain an expression of q(wo) using 
Equation 13 for the 2PLM. 


The first derivatives of @(w, A) are given by 


3) 


AW) _ g_yrg,_ evlts(v+A— 


b, 
Ow 7 (+ explaj(y +A — y)))’ 
Ob») a La ‘s explai(A—bi)] yn exp[aj(y +A—bj)] 
py ee “(1 + exp[ai(A — 64)]) 2 5S explas(h +A — By) 


Then, the matrix j(wv,A) of the negative of the second derivatives of ¢(q,) can be 


computed as 


~2 _ explaj(y+A—6,)] ~2 _ explaj(¥t+A-b,)] 
Ce oe 35 4 (1+exp|a; (o-+A—b;)])2 35 9 (1+exp|a, (~-+A—b;)])2 (17) 


~2 _ expla;(v+—b,)] 2 explai(A—bi)] ~2 _ expla;(b+A—b,)] 
25 G Chexplag erro? i % Trexplaabyp® + 245 G Trewpiaj(etr—by 


Noting that the off-diagonal elements of j(w,) are the same as its first diagonal, the 


determinant of j(@, A) is obtained as 


; 2»  expla(A — 0; <j expla; 


t exp[a;(A — 7+ Ge +r— 


ioe? 


(18) 


= Soa? exp|a;(0, — b;)| 3: exp[@j(@2 — bi) 
~ "(1+ exp[a;(@1 — 8,)])? ’ (1 + exp[a;(62 — 6;)])? 
To implement the MSLRT and LRA for the 2PLM, one has to compute the MLEs of 
9, and @2, denoted 6, and 6, respectively, and set the MLEs of w and A as w a 6. — 6, 
and \ = 61, respectively. One also has to compute, under the restriction ~ = Wo, which 
is equivalent to the restriction 0; = 02 = 09, the MLE of the common ability parameter 
§p>—let us denote this MLE as bo. Note that 6 is computed from all the item scores (X;’s 
and Y;’s) for an examinee. The estimate 6) can also be denoted as Nie according to the 


notation introduced earlier. 


2Note that the item parameters are assumed known throughout this paper. 


ne. 


Then, one obtains Li(a, d)| by replacing 6, and 6) by 6; and 65, respectively, in 
Equation 18, and obtains 7, (vo, uo) from Equations 8 and 17 as 


dr(wvo, Nie) = the second diagonal of j(W, A) computed at w = v9, A = rele 


= bs a2 exp[ai(Avo = b;)] | > a exp[aj(Ave = ) | 


7 (1 + explai(Avo — b)])? 7 (LF expla;(Avo — ,)1)? 


2 exp[a;(8o — b;)| | “2 exp|a,(40 — b 5)| 
» (1+ expai(4 — i » : by)])? i 


i 7 (1+ expla;(4 — 


Then, from Equation 13, q(wWo) is given by 


2__ explai(@1— nal exp[a; (02—b3)] 
q(o) = (02 — 61) 2s CexplelGi bo)? Ld Y Cresplas (2 
_ = ee =e LT, 2 rl Go) 
at (1+explai(O0—bi)])? | 55 (L+expl[a; (@o—b;)])2 


(19) 


which looks very similar to the Wald statistic given in Equation 2.3 


Further, Equations 9-11 imply that r(wo), the SLR statistic, is obtained as 
sks es . 1/2 
(io) = sign(h — vo) [2{€(0, A) — €(o, Ave) 


= sign(y) — o)Vv2 ss — to) + (S1 + S2)(A = Ayo)) + S5 log (1 — vi) 
1/2 
+) log(1 — 6;( + d)) ~ Dla 1 — pi(Avo)) — S_ log(1 — BjAvo)) 
j Jj 


= sign (0. = 6,)V2 [s (6; = 0) ate $5(00 = 60) = S- log(1 + exp[a;(6, — b;)| 


= Ds log(1 + exp|a; (62 — 6,)]) + eS log(1 + exp[a;(0o — b;)]) 
1/2 
+ S$ log(1 + exp[a;(4 — b;))) 


j 


(20) 


Once q(wW) is computed using Equation 19 and r(wq) is computed using Equation 20, 
one can compute the MSLR statistic r*(q%o), which is a function of q(wo) and r(wo), using 
Equation 12, and can compute the LRA using Equation 14. 


3If both 6, and 6) under the square root sign in the expression of g(wWo) are replaced by bo, q(wo) would 


become identical to the Wald statistic. 
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If q(wo) and r(wW9) are both close to zero, r*(¢9) involves the logarithm of the ratio 
of two very small numbers and hence becomes unstable (e.g., Jensen, 1995, p. 136). An 
approximation of r*(qo) in such a case was obtained by performing calculations similar to 
that in Example 5.3.4 of Jensen (1995, p. 136) and was employed whenever |r(wo)| < 0.05. 
The approximation involves tedious algebra and is not described here—it can be obtained 


upon request from the authors. 


An Example 


Let us consider the application of the MSLRT and LRA to detect item preknowledge 
for two examinees belonging to a data set that would be later described in the Real Data 


Section. 


Table 1. The Values of Several Quantities for Two Examinees. 


Examinee 6, Ay 00 r(wo) (wo) r*(vo) P&(r*(vo)) ®*(r(w0)) 
1 -1.80 -0.20 -1.43 3.09 2.77 3.05 0.9989 0.9989 
2 -1.44 -0.83 -1.28 1.22 1.16 1.18 0.8810 0.8810 


Table 1 lists the values of 61, 62, 40, r(wo), q(wo), r*(Wo), B(r* (a) and ©*(r(vo)) 
for the two examinees. The null hypothesis corresponds to no cheating (or no item 
preknowledge) while the alternative hypothesis corresponds to potential cheating. The 
p-values obtained from the MSLR statistic and the LRA were identical up to 4 decimal 
places for both the examinees. For the first examinee, 65 is considerably larger than 6; 
naturally, Hp is rejected at level 0.01 by all of SLR, MSLRT, and LRA. For the second 
examinee, 65 is not much larger than 6, and Hp is not rejected at level 0.05 by any of the 
statistics. 

Like in this example, the MSLRT and LRA led to very similar results in the analysis of 
simulated data and real data in this paper. Therefore, only the results of the LRA among 


these two statistics are discussed henceforth. 
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MSLRT and LRA to Test for the Equality of Two Abilities for the GPCM 


Let us consider the GPCM (Muraki, 1992) for which the log-likelihood of the examinee 


ability on a test with n polytomous items is given by 


n 


= > dy. (X;) log Pix (8), (21) 


i=1 k=0 


where X;, the examinee’s score on item 72, is an integer between 0 and m;, 
0 otherwise, 


Py.(8) = P(de(X;) = 1) = E(de(X,)) 


is the probability that the examinee’s score on item 72 is equal to k and is given by 


P,(6) = expneo (8 —bin)| — _ exp pao (9 — inl] (22) 
dicw0 EXPL Tj, =0 2(9 — Gin)] r;(9) 
where a,’s and 0;,’s are the slope and location parameters, respectively, and 
“> exp baa — B) . 
h=0 
Let us denote an examinee’s scores on two sets of polytomous items as_X;,2 = 1,2,...,71, 


and Y;,j = 1,2,...,m2, and the underlying true abilities as 6; and 62, respectively. Let us 
further assume that the possible scores on item 7 range between k = 0,1,...,m,;, and those 
on item j range between k = 0,1,...,m,. Let us also assume that the probabilities given by 
Equation 22 are denoted by P;,(@,) for the first set of items and by P. P (02) for the second 
set of items. It is proved in the appendix that for the GPCM, one can compute the MSLR 
statistic r*(wWo) using Equation 12 and the LRA using Equation 14, where q(w) is given by 


. So", a2 Var(X4|61) 32", a2 Var(Y;|62) 
avo) = (2 — 61) ye (23) 
Doin oF Var(X4|00) + 30524 a? Var(Y;|40) 


where, for example, 


p) 


Var(X;|01) = yer (Oi = 


Yo #Paldy (41) 
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and r(wo) is given by 


r(wo) = sign(O— V2|S2 3 ail X, )) log Pix (41) eva ) log Pix (2) 


i=1 k=0 j=l k=0 
My ng mM; 1/2 
— 5S ae ;) log Pix (Bo) — > S— de (¥j) log Pjx(Bo)| (24) 
1=1, k=0 j=l k=0 


If X;’s and Y,’s are dichotomous (that is, m; = m; = 1), then, for example, 


Var(X;|61) = pa(61)(1 — pa()| = eee and Equation 23 becomes equal to 


Equation 19, the corresponding expression for the 2PLM, and Equation 11 becomes equal 


to Equation 20. Thus, the expression of the MSLR statistic for the 2PLM is a special case 
of that for the GPCM. 


Simulation Studies 
Simulation 1: Measurement of Change Using Dichotomous Items 


A simulation somewhat similar to one in Finkelman et al. (2010) was performed to 
compare the performances of the statistics in the context of measurement of change. It was 
assumed that the null hypothesis is that the ability 6; at time point 1 is equal to that (2) 
at time point 2, that is, Hp : 6; = 02, and the alternative hypothesis is H, : 02 > 04. 

Two non-adaptive assessments, each with 10, 20, 30, or 50 dichotomous items, were 
used as the assessments administered at time points 1 and 2. As in Finkelman et al. (2010), 
the true value of either of 6; or 62 was considered to be equal to one among -2, -1.5, ...,1.5, 
2. Item-scores of 100,000 examinees were simulated for each possible combination of true 
6, and true 02 where 6; < 02 < min(6; + 1.5,2.0). Thus, for example, when true 0; is 0.5, 
true 62 can take only one of the four values 0.5, 1, 1.5, and 2.0; this strategy limits the 
maximum change in ability to 1.5. The nine combinations with true 6; = true 62, that is, 
the combinations (-2,-2), (-1.5,-1.5), ... ,(2,2), represent the “no change” condition and 
were used to study the Type I error rates of the statistics. The remaining 21 conditions 
with true #2 > true @; represent the “positive change” condition and were used to study the 


power of the statistics. 
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The 2PLM was used in the analysis. The sets of true item parameters for the two 
assessments were non-overlapping and were randomly drawn from a set of estimated item 
parameters from a language test that employs the 2PLM operationally. The MLE of 
ability, restricted to the range between -4.0 and 4.0, was used in the computations.* For 
each simulated examinee, item scores were simulated on the two assessments, the MLE of 
the ability was separately computed on the first assessment, second assessment, combined 
assessment, and then the Wald, LRT, SLR, and SS statistics and the LRA were computed 
using Equations 2, 4, 5, 7, and 14, respectively. 


Wald ss 
(jo) Oo 
vt vt 
wn wn 
a) Lo 2 ite) 
€ oT =e 0 
§ g 
oO oO 
ac 73 
Q lo = wo 
5 NN | 5 fu 
o o 
oO (jo) 
2 T T I mM 
2.0 2.5 3.0 3.5 4.0 2.0 2.5 3.0 3.5 4.0 
Theoretical Quantiles Theoretical Quantiles 
SLR Lugannani-Rice 


Sample Quantiles 
2.0 25 30 35 4.0 

Sample Quantiles 
2.0 25 30 35 4.0 


Theoretical Quantiles Theoretical Quantiles 


Figure 1. Normal Quantile Plots for the Four Statistics. 


“The results using the weighted maximum likelihood estimator (WLE; Warm, 1989) of ability were very 


similar. 
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Figure 1 shows the normal quantile plots, created using the function qqnorm in the R 
software (R Core Team, 2017)° for the Wald statistic, the SS statistic, the SLR statistic, 
and the LRA for all the examinees under the “no change” condition for test length of 30. 
The theoretical quantiles of the standard normal distribution are shown along the X-axis 
and the sample quantiles are shown along the Y-axis. In each panel, a diagonal line is also 
provided. The closeness of each curve to the diagonal line is a measure of the closness of 
the null distribution of the corresponding statistic to the standard normal approximation. 
Only the region where the quantiles are between 2 and 4 is shown in each panel.°® 

Figure 1 shows that for the Wald statistic, the sample quantiles are larger than the 
theoretical quantiles—this phenomenon suggests that the Wald statistic would have an 
inflated Type I error rate under a standard normal null distribution assumption. The 
curves for the SS and SLR statistics are comparatively closer to the diagonal line except for 
values larger than about 2.5 where sample quantiles are slightly smaller than the theoretical 
quantiles for the SS statistic and larger for the SLR statistic. The curve for the LRA is 
closest, among the four statistics, to the diagonal line for large values; this result implies 
that under the standard normal null distribution assumption, the Type I error rate for the 


LRA is 
e closest to the nominal level among these four statistics, 
e slightly more satisfactory than that of the SLR and SS statistics, and 


e considerably more satisfactory than that of the Wald statistic, especially at small levels 


of significance. 


In addition, the sample quantiles for the LRA are slightly smaller than or equal to the 


>To create the plot for LRA, which lies between 0 to 1, the standard normal quantile of the LRA was 
used as the input; that is because the use of the LRA provided by Equation 14 to test Ho is equivalent to the 
use of the standard normal quantile of the LRA as a statistic along with a standard normal null distribution 


assumption. 
°For quantiles between -4 and 2, the curves for the SS and SLR statistics and the LRA were very close 


to the diagonal line. 
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Table 2. Summaries of the Distributions of the Four Statistics Under the Null Hypothesis 
for Test Length of 30. 


Statistic Moments Percentiles 
Mean SD _ Skewness Kurtosis 25 50 75 95 99 
N (0, 1) 0.00 1.00 0.00 0.00 -0.67 0.00 0.67 1.64 2.33 
Wald 0.00 1.07 -0.02 0.47 -0.69 0.00 0.69 1.74 2.54 
SS .0O 1.01 0.00 -0.14 -0.69 0.00 0.69 1.66 2.29 
SLR 0.00 1.02 0.00 -0.02 -0.69 0.00 0.69 1.69 2.36 
LRA 0.00 0.99 0.00 -0.02 -0.68 -0.01 0.67 1.63 2.30 


theoretical quantiles through out, which implies that the Type I error rate for the statistic 
would not typically exceed the nominal level even for very small levels. 

Table 2 provides the first four moments (mean, SD, skewness, and kurtosis’) and five 
percentiles (25th, median, 75th, 95th, and 99th) of the standard normal distribution and 
the distributions of the four statistics for all the examinees under the “no change” condition 
and test length 30. Table 2 shows that the summary statistics for the Wald statistic are the 
farthest from those of the M’(0,1) distribution and those of the LRA are overall closest to 
those of the (0,1) distribution. 

Figure 2 shows the average Type I error rate (three panels on the left) and power 
(three panels on the right) for different test lengths of the Wald statistic, the SS statistic, 
the SLR statistic, and the LRA, averaged over all the true values of #, and 02. The top two, 
middle two, and bottom two panels show the results for significance levels of 0.001, 0.01, 
and 0.05, respectively. The Type I error rate of a statistic was computed as the proportion 
of statistically significant values of the statistic among the examinees with true 0, equal to 
true 62. The power of a statistic was computed as the proportion of statistically significant 
values of the statistic among the examinees with true 62 larger than true 0,. Note that 
the smallest significance level (0.001) that is considered in the figure is important because 


some of the applications of the test of equality of abilities involve detection of cheating on 


"Note that 3 has been subtracted from the formula of kurtosis so that the kurtosis of the standard normal 


distribution is 0 according to the formula used in this paper. 
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Figure 2. The Type I error rate and power for the four statistics for Simulation 1. 


assessments where very small significance levels are often recommended (e.g., Wollack et al., 
2015) to avoid potential adverse consequences of a false detection. In any panel of Figure 2, 
the test length is shown along the X-axis and the average Type I error rates or power for 
each test length is shown along the Y-axis. The values for the Wald statistic, SS statistic, 
SLR statistic, and LRA are shown using circles, triangles, cross signs, and diamond signs, 


respectively. A horizontal dotted line in each of the three panels on the left indicate the 
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nominal level. Although the range of the vertical axis varies over the three panels on the 


left, they are the same in the three panels on the right. 
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Figure 3. How Type I Error Rates Vary Over the True Abilities for the Four Statistics. 


The four panels of Figure 3 show the Type I error rates of the Wald statistic, the SS 
statistic, the SLR statistic, and the LRA for different values of 6; = 02 for the significance 
level of 0.01. A dotted horizontal line in each panel shows the values of the level. 


Figures 2 and 3 show that among the statistics, the LRA has the most satisfactory 
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Type I error rates overall. The Type I error rates for the LRA are slightly smaller than or 
equal to the nominal level for all combinations of test length and significance level. The 
Type I error rate of the Wald statistic is severely inflated at the level 0.001 and inflated at 
the other levels. This is expected from the top left panel of Figure 1. The Wald statistic 
was found to have inflated Type I error rate by Guo and Drasgow (2010) as well. The 
Type I error rates of the SS and SLR statistics are slightly inflated in some cases such as 
the bottom right panel of Figure 3. As test length increases, the Type I error rate of each 
statistic converges to the nominal level in all the three panels on the left of Figure 2°, but 
the rate for the Wald statistic is larger than the nominal level even for the test length of 50. 
For the test length of 50, the Type I error rate of the LRA is identical to the significance 
level in each of the three panels on the left of Figure 2 and in the bottom right panel of 
Figure 3 while that of the SS and SLR statistics is slightly larger than the nominal level in 
the two bottom panels on the left of Figure 2 and in the bottom right panel of Figure 3. 

Figure 2 shows that the power of each statistic increases steadily as test length 
increases. The power of the Wald statistic is the largest for all combinations of test length 
and significance level. The values of power of the other three statistics including the LRA 
are very close to those of each other. While Figure 2 shows the power after averaging over 
the three possible values of the ability difference, a separate analyses (whose results are not 
shown here and can be obtained from the authors upon request) revealed that the power 
of each statistic increases as the difference in ability increases. For example, for test length 
50 and significance level of 0.05, the power of each statistic is close to 0.3, 0.7, and 0.9, 
respectively, when the ability difference is 0.5, 1.0, and 1.5. 

Overall, it seems that the LRA achieves the nominal Type I error rate without losing 
too much power in comparison to the Wald, SLR, or SS statistics. Also, as test length 
increases, the Type I error rates of the other statistics converge to slightly above the 


nominal level in some cases, but that of the LRA converges to the exact significance level. 


8that is expected given that the null distribution of all these statistics converges to the standard normal 


distribution as test length increases 


Ze 


Simulation 2: Measurement of Change Using Polytomous Items 


A simulation like the earlier one was performed to compare the performances of the 
statistics for testing Ho : 6; = 02 versus H; : 02 > @; in the context of measurement of 
change using polytomous items. Two non-adaptive assessments, each with 5, 10, 20, or 40 
5-category polytomous items, were used as the assessments administered at time points 1 
and 2. As in the earlier simulation, the true value of either of #; or 62 was considered to be 
equal to one among -2, -1.5, ...,1.5, 2. Item-scores of 100,000 examinees were simulated for 
each possible combination of true 6; and true 02 where 6; < 62 < min(@; + 1.5, 2.0). 

The GPCM (Muraki, 1992) was used in the analysis. The sets of true item parameters 
for the two assessments were randomly drawn from a set of estimated item parameters from 
a data set from the NEO Personality Inventory that is considered in the second real data 
example in this paper. The MLE of ability, restricted to the range -4.0 and 4.0, was used in 
the computations. 

Figure 4 shows the average Type I error rate (the three panels on the left) and 
power (the three panels on the right) for different test lengths of the four statistics, averaged 
over all the true values of 6; and 62, at three significance levels: 0.001, 0.01, and 0.05. 

The LRA has the most satisfactory Type I error rates (that are very close to the 
nominal level) followed by the SLR statistic in Figure 4. The Wald statistic, again, has 
inflated Type I error rates. The SS statistic occasionally has slightly inflated Type I error 
rates (for example, for test length of 5 and 10 in the bottom left panel). As in Simulation 
1, the power of the Wald statistic is the largest for all cases and the values of power of the 


other three statistics are very close to those of each other. 


Simulation 3: Detection of Item Preknowledge 

A simulation somewhat similar to that in Sinharay (2017) was performed to compare 
the performance of the statistics in the context of detecting item preknowledge when a 
known set of items has been compromised. In these simulations, a set of items was assumed 
to have been compromised. It was assumed that the null hypothesis is that the ability over 


the non-compromised items (6,) is equal to that over the compromised items (62), that is, 
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Figure 4. The Type I error rate and power for the four statistics for Simulation 2. 


Ho : 0, = 82; the alternative hypothesis was Hy, : 62 > 0;. 

A non-adaptive assessment with 100 dichotomous items was used as the whole 
assessment. The true values of the ability of those who did not benefit from item 
preknowledge (non-cheaters) were simulated from a standard normal distribution. The 
true values of the ability of those who benefited from item preknowledge (cheaters) were 


simulated from a standard normal distribution or a N(—0.5,1) distribution; the first 
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distribution represents the case when the cheaters are the same as the non-cheaters in 
ability and the second represents the case when the cheaters have lower ability on average 
than the non-cheaters. The set of compromised items was assumed to be a subset of size 
10, 20, or 30 of the whole assessment; thus, the set of non-compromised items included the 
remaining 90, 80, or 70 items of the assessment. The proportion of cheaters was assumed 
to be 0.05, 0.10, or 0.20 of the number of non-cheaters. It was assumed that for each 
simulation condition, the number of non-cheaters was 10,000; this means that the number 
of cheaters was 500, 1000, or 2000 in the various simulation conditions. Thus, 18 simulation 
conditions (involving all combinations of two ability distributions of the cheaters, three 
sizes of the set of compromised items, and three proportions of cheaters) were used. 

The 2PLM was used in the analysis. The item-scores of all examinees on the 
non-compromised items and the item-scores of the non-cheaters on the compromised items 
were simulated from the 2PLM. It was assumed, as in Sinharay (2017), that if an examinee 
has preknowledge of an item, his/her probability of a correct answer on the item was 0.90; 
therefore, the item-scores of the cheaters on the compromised items were simulated as 
draws from a Bernoulli distribution with a success probability of 0.90. 

The true item parameters were randomly drawn from the set of estimated item 
parameters from a real data set that is discussed later in this paper. The MLE of ability, 
restricted to the range -4.0 and 4.0, was used in the computations. For each examinee, the 
true ability and the item scores were simulated (where the simulating distributions depend 
on whether the examinee cheated or not), the ability estimate was separately computed on 
the compromised items, non-compromised items, and all items, and the Wald test statistic, 
the SLR statistic, the SS statistic, and the LRA were computed. 

Table 3 shows the Type I error rates and power of the four statistics, averaged over all 
the simulation conditions, at three significance levels: 0.001, 0.01, and 0.05. The Type I 
error rate of a statistic was computed as the proportion of statistically significant values 
of the statistic among the non-cheaters. The power of a statistic was computed as the 
proportion of statistically significant values of the statistic among the cheaters. 


Table 3 shows that the Type I error rates for the LRA are very close to the nominal 
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Table 3. The Type I error rate and power for the three statistics for Simulation 3. 


Significance Type I Error Rate Power 
Level Wald SS SLR LRA Wald SS SLR LRA 
0.001 0233 .0005 .0012 .0007 a ele. ale 19 
0.01 .0340 .0066 .0141 .0096 45 .31 40 .36 
0.05 0727 .0382 .0620 .0500 ‘58: 50.256 54 


level at levels 0.01 and 0.05 and slightly conservative at the level of 0.001. The Type I error 
rates of the Wald statistic are severely inflated at all the levels. Guo and Drasgow (2010) 
found the Wald statistic to often have inflated Type I error rates in the context of detection 
of cheating on unproctored tests. The Type I error rates of the SLR statistic are slightly 
inflated at all the levels. The Type I error rates of the SS statistic are smaller than the 
nominal level in all cases. The power of the Wald statistic was the largest for all significance 
levels and that of the SS statistic is the smallest for all levels. The power of the LRA is the 
second smallest for all the three levels although its power is within 0.04 of that of the SLR. 
Overall, it seems that as in the earlier simulations, the LRA achieves the nominal Type I 
error rate without losing too much power in comparison to the existing statistics. 

Figure 5 shows the average Type I error rates and power of the Wald, SS, and SLR 
statistics and the LRA for different number of compromised items (10, 20, or 30). The 
three panels on the left show the Type I error rates and the three panels on the right show 
power. Figure 5 shows that the Type I error rate of the Wald statistic decreases to the 
nominal level as the number of compromised items increases, but is considerably larger than 
the nominal level even for 30 compromised items. The figure also shows that the Type I 
error rate of the SLR statistic decreases to the nominal level as the number of compromised 
items increases, but is slightly larger than the nominal level even for 30 compromised items 
for levels of 0.01 and 0.05. In contrast, the Type I error rate for the LRA is very close 
to the nominal level in all cases. This result provides an empirical proof of a much faster 
convergence of the LRA to the true p-value compared to the Wald statistic and a slightly 
faster convergence of the LRA to the true p-value compared to the SLR and SS statistics. 


26 


Type | Error Rate 
0.02 0.04 


0.00 


Type | Error Rate 
0.04 0.08 0.12 


Type | Error Rate, Level= 0.001 


T T 
10 20 
Number of Compromised Items 


Type | Error Rate, Level= 0.01 


30 


20 
Number of Compromised Items 


Type | Error Rate, Level= 0.05 


jo) 
2 
oO 
ao 
= 
tT 
eS 
LUO 
oO feel: 
Qo Kies Ge ee Ree o Sie ch SE le ei St an, 
Pies [er tr eer aoe eens ge: 
© + a-----------7777 
o T T T 
10 30 


|} o-~--~-------------6-----------------2 


Aewcceerceorlce 


20 
Number of Compromised Items 


30 


Power, Level=0.001 


20 
Number of Compromised Items 


Power, Level=0.01 


20 
Number of Compromised Items 


Power, Level=0.05 


Power 


0.0 


10 20 30 
Number of Compromised Items 


Figure 5. Type I Error Rate and Power for Different Number of Compromised Items. 


At a given level, the power of all statistics increases with an increase in the number of 


compromised items. 


Real Data Examples 
Example 1: Detection of Item Preknowledge 
Let us consider item-response data from one form of a non-adaptive licensure 
assessment. The data set was analyzed in several chapters of Cizek and Wollack (2017). 
The form includes 170 operational items that are dichotomously scored. Item scores were 
available for 1,644 examinees for the form. The licensure organization who provided the 


data identified 61 items on the form as compromised. The organization also flagged 48 
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individuals on the form as possible cheaters from a variety of statistical analysis and a 
rigorous investigative process that brought in other information; given the rigor of the 


investigative process, these examinees will be treated as true cheaters. 


Table 4. The Proportion Significant for the Various Statistics for the Real Data Set. 


Group of | Statistic .001 .01 .05 
Not flagged Wald  .014 .038 .084 
Not flagged SS 006 .033  .079 
Not flagged SLR 005 .031 .078 
Not flagged LRA  .004 .028 .074 


Flagged Wald 125 .271 .312 
Flagged SS OR: wise: SD 
Flagged SLR 125 .188 .312 
Flagged LRA 125.188 312 


The values of the Wald statistic, SS statistic, and the SLR statistic were computed 
for each individual in the data set. The LRA was also used to compute a p-value for each 
examinee. The set of 109 non-compromised items was considered as the first set of items 
and the set of 61 compromised items were considered as the second set of items. The null 
hypothesis of equal abilities over the two set of items (Ho : 6; = 02) corresponds to no 
cheating while the alternative hypothesis corresponds to potential cheating (H, : 02 > 01). 
The Rasch model is operationally used in the assessment—the 2PLM was found to fit the 
data better and was used for the analysis. The item parameters were estimated using the 
marginal maximum likelihood estimation procedure from the data set and used in the 
computation of the statistics. The MLEs of the abilities, restricted to the range -4.0 and 
4.0, were used to compute the statistics. 

The proportion of examinees for which the statistics were significant at significance 
levels of 0.001, 0.01, and 0.05 are provided in Table 4. The first four rows of Table 4 include 
the proportions significant among the examinees who were not flagged by the licensure 
organization. The last four rows of the table include the proportions significant only among 


the 48 examinees who were flagged by the licensure organization; thus, for example, the 
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proportion 0.125 for the Wald statistic at level=0.001 in the fifth row of numbers implies 
that among the 48 examinees flagged by the licensure organization, the Wald statistic was 
significant at level=0.001 for six examinees (note that 6/48=0.125). 

Table 4 shows that the proportion of significant values for the Wald statistic is larger 
than or equal to that for the other statistics in all cases, which is in agreement with the 
simulation studies. Table 4 also shows that the proportions of significant values are close 
for the SLR and SS statistics and LRA in all cases; among these three statistics, the LRA 
leads to a smaller percentage of significant values than the SLR and SS statistics among the 
non-flagged examinees and leads to an equal percentage of significant values as the SLR 
and SS statistics among the flagged examinees. 

Table 4 also shows that the proportion significant for each statistic is much larger 
among the examinees flagged by the licensure organization (bottom four rows of the table) 
than among those not flagged (top four rows of the table)—this result provides some 
evidence that the statistics are somewhat successful—they are significant at a larger rate 


among the examinees who are true cheaters. 


Example 2: Comparison of Performance Over Two Subtests 


Let us consider a data set from NEO Personality Inventory that was analyzed by Glas 
and Dagohoy (2007). The NEO Personality Inventory is a personality test designed to 
provide a general description of normal personality that is relevant to clinical, counseling, 
and educational situations. The inventory is based on the five-factor model of personality 
(Costa & McCrae, 1992) and consists of five broad domains. For each domain, six facet 
scores have been developed to provide specific levels of information. Each facet is measured 
by eight items each of which is rated on a five-point scale. Data from 1,168 individuals 
on the neuroticism domain was analyzed in Glas and Dagohoy (2007) who split the 48 
items on the domain into three sub-tests so that Items 1-8 and 9-16 in each sub-test relate 
to different facets; they also found the data within each sub-test to be unidimensional. 
The unidimensional GPCM was separately fitted to each sub-test and the estimated item 


parameters were used to test the null hypothesis that the examinee ability is the same over 
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Items 1-8 and 9-16 (that is, same over the two facets under the sub-test) against a one-sided 
alternative hypothesis. The MLE of examinee ability, restricted between -4 and 4, was used 


in the computations. 


Table 5. The Proportion Significant for the statistics for the Second Real Data Example. 


Sub-test Wald SS SLR LRA 


1 138. 081-119: cIT8 
2 132 091 .125 .124 
3 129.084 ~.110_—«.109 


Table 5 shows the proportions of statistically significant p-values at 5% level for the 
three sub-tests for the Wald statistic, SS statistic?, SLR statistic, and LRA. The proportions 
are largest for the Wald statistic followed by the SLR statistic. The proportions for the 
SLR statistic and LRA are very close. 


Conclusions 


Hypothesis-testing approaches based on higher-order asymptotics (Barndorff-Nielsen & 
Cox, 1994; Ghosh, 1994) were applied to the problem of testing of whether the ability of 
an examinee is the same over two sets of items. Such problems arise in various contexts in 
educational and psychological measurement including measurement of change (e.g., Fischer, 
2003) and detection of test cheating (e.g., Guo & Drasgow, 2010; Wollack & Schoenig, 
2018). The modified signed likelihood ratio test (MSLRT; Barndorff-Nielsen, 1986) and 
the Lugannani-Rice approximation (LRA; Lugannani & Rice, 1980) were found to perform 
better than the signed likelihood ratio (SLR) statistic, the signed score (SS) statistic (e.g., 
Cox & Hinkley, 1974; Sinharay, 2017), and the Wald statistic (e.g., Cox & Hinkley, 1974; 
Finkelman et al., 2010). In the simulations, the MSLRT and LRA led to Type I error rates 
that are quite close to the nominal level, even when these tests are based on a few items, 


and not larger the nominal level in general; this result is encouraging because false positives 


°the SS statistic in this case is a signed square root of the statistic that Glas and Dagohoy (2007) used 


to test against a two-sided alternative 
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in the context of detection of test cheating may have dire consequences (e.g., Skorupski & 
Wainer, 2017) and should be minimized (e.g., Ferrara, 2017) and the MSLRT and LRA 
would lead to the fewest false positives among the statistics considered here; especially, the 
satisfactory Type I error rate of the LRA for small significance levels is promising because 
of the typical use of conservative significance levels (such as 0.001) in detection of test 
cheating (e.g., Wollack et al., 2015). 

The suggested statistics can be considered as person-fit statistics in the same way that 
the Langrangian multiplier test statistic of Glas and Dagohoy (2007) or the three statistics 
of Klauer and Rettig (1990) are person-fit statistics—the suggested statistics are computed 
for each individual examinee and they can be used to detect one specific type of person 
misfit—one characterized by a difference in performance over two sets of items. 

The choice of the significance level to be used with the suggested statistics is an 
important issue. Wollack and Eckerly (2017, p. 227) used the significance level of 0.001 in 
their real data example to limit the number of false positives and commented that states or 
test sponsors would apply a conservative criterion in practice. Another option to limit the 
number of false positives is to choose a critical value that adjusts for multiple comparisons 
by controlling the family-wise error rate using a Bonferroni correction, or, controlling the 
false discovery rate using the procedure of Benjamini and Hochberg (1995). 

The suggested approaches were derived for the 2PLM and the GPCM. They can also 
be applied to the dichotomous or polytomous Rasch models that are special cases of the 
2PLM and GPCM, respectively. Unfortunately, the likelihood distribution of the ability for 
the three-parameter logistic model (3PLM) does not belong to the exponential family of 
distributions (e.g., Biehler et al., 2015)—so the methods in this paper do not apply to the 
3PLM. Researchers such as Skovgaard (1990) suggested approaches to apply the MSLRT 
and LRA to distributions that do not belong to the exponential family of distributions, but 
the application of those approaches does not seem to be straightforward to the 3PLM. While 
application of the MSLRT and the LRA to the 3PLM remains a topic for further research, 
the suggested approaches (that can be used in applications of Rasch models, 2PLM, and 
GPCM) should be useful given that (i) the Rasch models, 2PLM, and GPCM are widely 
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used, (ii) Haberman (2006) found the 3PLM to not provide much gain for real data over 
the 2PLM,"° (iii) researchers such as Maris and Bechger (2009) and Martin, Gonzalez, and 
Tuerlinckx (2015) have recently unearthed some problems with the identifiability of the 
3PLM. 

This paper has several additional limitations and hence leaves scope for further related 
research. First, more simulations to more cases of testing of the equality of abilities can be 
performed to further explore the performance of the MSLRT and the LRA. Similarly, more 
real data applications might provide deeper understanding of the approaches. Second, only 
a one-sided alternative hypothesis was considered—it is possible to explore two-sided 
alternatives; the MSLRT and LRA would then involve two one-sided tests while the 
traditional approaches would be the LRT and the square of the Wald or score statistic. 
Some limited examination shows that the suggested methods perform slightly better 
compared to existing tests when two-sided alternatives are of interest. Third, hypothesis 
tests involving a unidimensional ability parameter was considered—tests involving 
multidimensional ability parameters would be an obvious next step. Fourth, because the 
approaches considered in this paper are based on item-scores of one examinee at a time, 
their power is expected to be low, as seen from the simulation studies; in contrast, methods 
such as matching analysis (Haberman & Lee, 2017) or aggregate-level erasure analysis 
(Wollack & Eckerly, 2017) would have larger power because those methods are based on 
the whole sample or a group of examinees; still, the approaches suggested in this paper 
promise to be helpful because hypothesis tests based on item-scores of one examinee at a 


time are routinely performed by researchers and testing organizations. 
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Appendix: The MSLRT and the LRA for the GPCM 


Donoghue (1994) provided the result that for log likelihood given by Equation 21, 
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Equation 25 can be rewritten as 
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For two sets of items, using notations introduced below Equation 22, the joint log 


likelihood of 0; and 62 is given by 
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The last equality holds because )7;"9 de( Xi) = D042 de(Y;) = 1 under the assumption that 
no data are missing, which means, for example, that 577") de(X;) log(Ti(01)) = log(Ti(@1)). 
Let us apply the transformations ~ = 02 — 6, and A = @,, which means that 6; = \ and 

0. =w+ . Let us also denote 


= Yad k +1)d,(X;), and S> ay setae): 
i=1 j=l k=0 
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Note that both S; and 5, are functions of the data (X;’s and Y;’s) and not of the 
parameters (0, and 62). The above log-likelihood, ¢(6,02), or, @(W, A), then is given by 


ni Mm; ng ™; 


C(W,d) = YoYo de(X;) log P(A) + 52 S> dy (¥j) log Pya(w + 2) (28) 
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The above log-likelihood belongs to the exponential family of distributions with canonical 
parameters ~ and 4 and joint sufficient statistics S,; and (1; + 53). 

Then, given the discussion on the applicability of the MSLRT and LRA to the exponential 
family of distributions, the MSLRT and LRA can be applied to test Ho : wy = 0, or, 

Ho : 0, = 82, in applications of the GPCM. The SLR statistic is given by 
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Then, using the result provided in Equation 26 (or, by differentiating the joint log 
likelihood provided in Equation 29 twice), 
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Then, 
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One can obtain j)\ (wo, Mi) from Equation 31 as 
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Once q(wWo) is computed using Equation 34 and r(wo) is computed using Equation 30, one 
can compute the MSLR statistic r*(wWo) using Equation 12, and can compute the LRA 
using Equation 14. 
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